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1 Introduction 


Simplicial complexes are widely used in combinatorial and computational topology, and have found 
many applications in topological data analysis and geometric inference. The most common rep¬ 
resentation uses the Hasse diagram of the complex that has one node per simplex and an edge 
between any pair of incident simplices whose dimensions differ by one. A few attempts to obtain 
more compact representations have been reported recently. 

Attali et al. [ALS12] proposed the skeleton-blockers data structure which represent a simplicial 
complex by its 1-skeleton together with its set of blockers. Blockers are the simplices which are 
not contained in the complex but whose proper subfaces are. Flag complexes have no blockers 
and the skeleton-blocker representation is especially efficient for complexes that are “close” to flag 
complexes. An interesting property of the skeleton-blocker representation is that it enables efficient 
edge contraction. 

Boissonnat and Maria [BM14] have proposed a tree representation called the Simplex Tree 
that can represent general simplicial complexes and scales well with dimension. The nodes of 
the tree are in bijection with the simplices (of all dimensions) of the simplicial complex. In this 
way, the Simplex Tree explicitly stores all the simplices of the complex but it does not represent 
explicitly all the incidences between simplices that are stored in the Hasse diagram. Storing all the 
simplices is useful (for example, one can then attach information to each simplex or store a filtration 
efficiently). Moreover, the tree structure of the Simplex Tree leads to efficient implementation of the 
basic operations on simplicial complexes (such as retrieving incidence relations, and in particular 
retrieving the faces or the cofaces of a simplex). 

In this paper, we propose a way to compress the Simplex Tree so as to store as few nodes and 
edges as possible without compromising the functionality of the data structure. The new compressed 
data structure is in fact a finite automaton (referred to in this paper as the Minimal Simplex 
Automaton) and we describe an optimal algorithm for its construction. Previous works have looked 
at trie compression and have tried to establish a good trade-off between speed and size, but in most 
of the works, the emphasis is on one of the two. Two examples of work where the speed is of main 
concern are [AN93] where the query time is improved by reducing the number of levels in a binary 
trie (which corresponds to truncating the Simplex Tree at a certain height) and [AZS99] where 
trie data structures are optimized for computer memory architectures. Other popular compact 
representations for tries in connection with predictive text compression are discussed in [TR86], 
but they only include (all) substrings of constant length that exist in a text and also do not focus 
on supporting efficient access in the compressed trie. Therefore, such representations are not useful 
here, due to the loss of significant information. 

When the size of the structure is of primary concern, the focus is usually on automata com¬ 
pression. For instance, in the context of natural language data processing, significant savings in 
memory space can be obtained if the dictionary is stored in a directed acyclic word graph (DAWG), 
a form of a minimal deterministic automaton, where common suffixes are shared [AJ88]. However, 
theoretical analysis of compression is seldom done (if at all), in any of these works. In this paper, we 
analyze the size of the Minimal Simplex Automaton and also demonstrate (through experiments) 
that compression works especially well for Simplex Tree due to the structure of simplicial complexes: 
namely, that all subfaces of a simplex in the complex also belong to the complex. Additionally, 
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we consider the influence of the labeling of the vertices on compression, which can be significant. 
Further, we show that it is hard to And an optimal labeling for the compressed Simplex Tree and 
for the Minimal Simplex Automaton. 

We introduce two new data structures for simplicial complexes called the Maximal Simplex 
Tree (MxST) and the Simplex Array List (SAL).MxST is a subtree of the Simplex Tree whose 
leaves are in bijection with the maximal simplices (i.e., simplices with no cofaces) of the complex. 
We show that this data structure is compact and that it allows efficient operations. MxST is 
augmented to obtain SAL where every node uniquely represents an edge. A nice feature of SAL is 
its invariance over labeling of vertices. We show that SAL supports efficient basic operations and 
that it is compact when the dimension of the simplicial complex is fixed, a case of great interest in 
Manifold Learning and Topological Data Analysis. 


2 Simplicial Complex: Definitions and a Lower Bound 


A simplicial complex K is defined over a (finite) vertex set V whose elements are called the vertices 
of K and is a set of non-empty subsets of V that is required to satisfy the following two conditions: 

1. p eV ^ {p} £ K 

2 . a£K,TCa^T£K 

Each element a £ K is called a simplex or a face of K and, if a £ K has precisely s -|- 1 
elements (s > — 1), fi is called an s-simplex and the dimension of a is s. The dimension of the 
simplicial complex K is the largest d such that it contains a d-simplex. 

A face of a simplex a = {po, .■.,Ps} is a simplex whose vertices form a subset of {po, ...,ps}. A 
proper face is a face different from a and the facets of cr are its proper faces of maximal dimension. 
A simplex t £ K admitting u as a face is called a coface of a. 

In this paper, the class of d dimensional simplicial complexes on n vertices with m simplices, 
of which k are maximal, is denoted by )C{n,k,d,m), and K denotes a simplicial complex in 
IC{n,k,d,m). At times, we say Kg £ }Cg{n,k,d,m) (where 6 : V ^ {1, 2,..., |I/|} is a labeling 
of the vertex set V of K) when we want to emphasize that some of the data structures seen in this 
paper are influenced by the labeling of the vertices. 

A maximal simplex of a simplicial complex is a simplex which is not contained in a larger 
simplex of the complex. A simplicial complex is pure, if all its maximal simplices are of the same 
dimension. Also, a free pair is defined as a pair of simplices (r, a) in K where r is the only coface of 
a. In Figure 1, we have a simplicial complex on vertex set {1, 2,3,4, 5, 6} which has three maximal 
simplices: the two tetrahedra 1-3-4-5 and 2-3-4-5, and the triangle 1-3-6. We use this complex 
as an example through out the paper. 

The flag complex of an undirected graph G is defined as an abstract simplicial complex, whose 
simplices are the sets of vertices in the cliques of G. Let {P, d) be a metric space where P is a 
discrete point set. Given a positive real number r > 0, the Rips complex is the abstract simplicial 
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Figure 1: Simplicial complex with the tetrahedra 1-3-4-5 and 2-3-4-5, and the triangle 1-3-6. 


complex TV'{P) where a simplex a G TV'{P) if and only if d{p,q) < r for every pair of vertices of 
a. Note that the Rips complex is a special case of a flag complex. This completes the definition of 
the complexes which will be used in this paper. 

We would like to note here that the case when k = 0{n), is of particular interest. It can be 
observed in flag complexes, constructed from planar graphs and expanders [ELSIO], and in general, 
from nowhere dense graphs [GKS13], and also from chordal graphs [G80]. Generalizing, for all flag 
complexes constructed from graphs with degeneracy O(logn) (degeneracy is the smallest integer r 
such that every subgraph has a vertex of degree at most r), we have that k = [ELSIO]. This 

encompasses a large class of complexes encountered in practice. 

Now, we obtain a lower bound on the space needed to represent simplicial complexes by 
presenting a counting argument on the number of distinct simplicial complexes. 

Theorem 1. Consider the class of all simplicial complexes on n vertices of dimension d, containing 
k maximal simplices, where d>2 and k > n + 1, and consider any data structure that can represent 

the simplicial complexes of this class. Such a data structure requires log l)its to he stored. 

For any constant e G (0,1) and for < k < and d < , the bound becomes Q{kdlogn). 


Proof. The proof of the first statement is by contradiction. Let us define h = k — n > 1 and suppose 

that there exists a data structure that can be stored using only s < logo = log ) bits. We 
will construct a simplicial complexes, all with the same set P of n vertices, the same dimension 
d, and with exactly k maximal simplices. By the pigeon hole principle, two different simplicial 
complexes, say K and iL', are encoded by the same word. So any algorithm will give the same 
answer for K and K'. But, by the construction of these complexes, there is a simplex which is in 
K and not in K'. This leads to a contradiction. 

The simplicial complexes are constructed as follows. Let P' C P be a subset of cardinality 
n/2, and consider the set of all possible simplicial complexes of dimension d with vertices in P' 
that contain h maximal simplices. We further assume that all maximal simplices have dimension 

d exactly. These complexes are a = in number, since the total number of maximal d 

dimensional simplices is {2!^^ nnd we choose h of them. Let us call them ri,...,rQ,. We now 
extend each Tj so as to obtain a simplicial complex whose vertex set is P and has exactly k 
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maximal simplices. The maximal simplices will consist of the h maximal simplices of dimension d 
already constructed plus a number of maximal simplices of dimension 1. The set of vertices of Tj, 
vert(rj), may be a strict subset of P'. Let its cardinality be ^ — r, and observe that 0 < rj < ^. 
Consider now the complete graph on the § + r* vertices of P \ vert(rj). Any spanning tree of 
this graph gives ^ + rj — 1 edges and we arbitrarily choose ^ — r* + 1 edges from the remaining 
edges of the graph to obtain n distinct edges spanning over the vertices of P \ vert(rj). We have 
thus constructed a 1-dimensional simplicial complex Ki on the ^ + r* vertices of P \ vert(rj) with 
exactly n maximal simplices. Finally, we define the complex Aj = Tj U Aj that has P as its vertex 
set, dimension d, and k maximal simplices. The set of Aj, i = I,-- - ,a, is the set of simplicial 
complexes we were looking for. 

The second statement in the theorem is proved through the following computation: 


n/2 

log ( 1 

k — n 


(d-\-l)(k—n) 


> log [ - - - I 

= (d + 1) (A; — n) log n — (d + 1) (/c — n) 

— (d + 1)(A: — n) log(d + 1) — (k — n) log{k — n) 

> (d + 1) (A; — n) log n — 3(d + 1) (fe — n) — (d + 1) (fe — n) log d — {k — n) log k 

> {d + l){k — n)(logn — 3 — logd) — {k — n){l — e)dlogn 

> de{k — n) log n + [k — n) log n — (d + 1)(A: — n)(3 + - log n) 

3 

> — (l — - ] kd log n + (A; — n) log n — 3d{k — n) — {k — n){3 + - log n) 

O ^ ^ / O 

= n(Adlogn) 


We note that in the above computation, the first inequality is obtained by applying the following 
bound on binomial coefficients: Q) > D 


We can adapt the above proof to build n maximal simplices on P \ vert(rj) each of dimension 
dj to ensure the lower bound applies also to pure simplicial complexes. This is done by first building 
niaximal simplices of dimension d on vertices of P\vert(rj) and then, building 
one maximal simplex which contains all the remaining vertices. We would complete the construction 
of k maximal simplices in the complex by choosing n — ~ ^ maximal simplices of 

dimension d from vertices of P \ vert(rj). 

Theorem 1 applies particularly to the case of pseudomanifolds of fixed dimension where we have 
k < n 2 (i.e., £ = ^ suffices) [BB97]. The case where d is small is important in Manifold Learning 
where it is usually assumed that the data live close to a manifold of small intrinsic dimension. The 
dimension of the simplicial complex should reflect this fact and ideally be equal to the dimension 
of the manifold. 
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3 Compression of the Simplex Tree 


Let K G IC{n, k, d, m) be a simplicial complex whose vertices are labeled from 1 to n and ordered 
accordingly. We can thus associate to each simplex of iL a word on the alphabet set ,n}. 

Specifically, a j-simplex of K is uniquely represented as the word of length j + 1 consisting of 
the ordered set of the labels of its j + 1 vertices. Formally, let a = . ,Vi.} be a simplex, 

where are vertices of K and £i G {1,... ,n} and Iq < ■ ■ ■ < ij . a is represented by the word 
[fj] = [^ 0 ) • • ■ ,^j]- The last label of the word representation of a simplex a will be called the last 
label of a and denoted by last((T). The simplicial complex K can be defined as a collection of 
words on an alphabet of size n. To compactly represent the set of simplices of K, we store the 
corresponding words in a tree satisfying the following properties: 

1. The nodes of the tree are in bijection with the simplices (of all dimensions) of the complex. 
The root is associated to the empty face. 

2. Each node of the tree, except the root, stores the label of a vertex. Specifically, the node N 
associated to a simplex u 7 ^ 0 stores the label of the vertex last((T). 

3. The vertices whose labels are encountered along a path from the root to a node N associated 
to a simplex a, are the vertices of a. The labels are sorted by increasing order along such a 
path, and each label appears exactly once. 

This data structure is called the Simplex Tree of K [BM14] and denoted by ST(iL) or simply 
ST when there is no ambiguity. It may be seen as a trie [BS97] on the words representing the 
simplices of the complex. The depth of the root is 0 and the depth of a node is equal to the 
dimension of the simplex it represents plus one. Also, in this paper we assume that ST is directed 
from the root to the leaves. 

We give a constructive definition of ST. Starting from an empty tree, insert the words represent¬ 
ing the simplices of the complex in the following manner. When inserting the word [c] = [£q,- 
start from the root, and follow the path containing successively all labels £o )' where [f'o, • • •, 
denotes the longest prefix of [fi] already stored in the Simplex Tree. Next, append to the node 
representing [£o, • • ■■,£i] a path consisting of the nodes storing labels ii+i,- ■ ■,^j- In Figure 2, we 
give ST for the simplicial complex shown in Figure 1. 

If K consists of m simplices (including the empty face), the associated ST contains exactly m 
nodes. Thus, we need 0(mlogn) space/bits to represent ST (since each node stores a vertex which 
needs 0(logn) bits to be represented). We can compare this to the lower bound of Theorem 1. In 
particular, if A; = 0(1) then, ST requires at least 0(2'^ logn) bits where as Theorem 1 proves the 
necessity of only Q{dlogn) bits. Therefore, while the Simplex Tree is an efficient data structure for 
some basic operations such as determining membership of a simplex and computing the r-skeleton 
of the complex, it requires storing every simplex explicitly through a node, leading to combinatorial 
redundancy. To overcome this, we introduce a compression technique for the ST. 


6 




Figure 2: Simplex Tree of the simplicial complex in Figure 1. 


3.1 Compressed Simplex Tree 

Consider the ST in Figure 3 and note that the red shaded region appears twice. The goal of the 
compression is to identify these common parts and store them only once. More concretely, if the 
same subtree is rooted at two different nodes in ST then, the subtree is stored only once and the 
two root nodes now point to the unique copy of the subtree. As a consequence, the nodes are no 
longer in bijection with the nodes of the complex (as it was in the case of ST), but we still have 
the property that the paths from the root are in bijection with the simplices. We see in Figure 4, 
the compressed ST of the simplicial complex described in Figure 1. In the rest of the paper, we 
denote by C, this action of compression. Also, unless otherwise stated |ST| and |C(ST)| refer 
to the number of edges in ST and C(ST) respectively. 



Answering simplex membership queries and other queries that only require traversing ST from 
root to leaves can be implemented in C(ST) exactly as in ST [BM14]. Allowing upward traversal 
in ST is also possible (with additional pointers from children to parents), and this has been shown 
to improve the efficiency of some operations, such as face or coface retrieval. However, in C(ST), 
parents are not unique. To account for this, we mark the parents that were accessed, and use this to 
go back in the upward direction. This implies an additional storage of 0{d\ogn) while traversing, 
but a node (simplex) having many parents can assist to locate cofaces much faster. 
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Figure 4: Compressed Simplex Tree of the Simplex Tree given in Figure 2. The extent of compres¬ 
sion is demonstrated by the following: the edge 4 — 5 which appears six times in the Simplex Tree 
of Figure 2, appears only once in the Compressed Simplex Tree 


Next, we will introduce an automaton perspective of the above compression and show how 
to deduce the optimal compression algorithm for ST. We will also describe insertion and removal 
operations on C(ST) through the automaton perspective. 

3.2 Minimal Simplex Automaton 

A Deterministic Finite state Automaton (DFA) recognizing a language is defined by a set of states 
and labeled transitions between these states to detect if a given word is in a predefined language 
or not. ST can be seen as a DFA: let us define the set of m states by V = {nodes of ST}. A 
transition from state u to state v is labeled by a if and only if there is in ST an edge from u to v, 
and V contains the vertex a. We define the Simplex Automaton of K (denoted by SA{K)) as the 
automaton described above (cf. Figure 5). 



Figure 5: Simplex Automaton of the simplicial complex in Figure 1. 











SA is basically the same data structure as ST except that the labels are not put on the nodes 
but on the edges entering these nodes and thus, basic operations in SA can be implemented as in 
ST. Also, by construction of SA, it is obvious that the number of states and transitions of SA are 
equal to the number of nodes and edges in ST respectively. 

It is known [N58] that if a language L is regular (accepted by a DFA) then, L has a unique 
minimal automaton. DFA minimization is the task of transforming a given DFA into an equivalent 
DFA that has a minimum number of states. We represent the action of performing DFA minimiza¬ 
tion by A4. For any K G /Ce(n, k, d, m), let us define the Minimal Simplex Automaton (AI(SA)) as 
the minimal deterministic automaton which recognizes the language Lxg- Compressing ST can be 
seen as DFA minimization since merging identical subtrees corresponds to merging indistinguish¬ 
able states in the automaton. It is possible to get C(ST) from A4(SA) by duplicating the states 
such that for each node, the labels of all its incoming edges are the same, and then by moving the 
labels from the edges to the next node. Also, it should be observed that the number of edges in 
C(ST) and the number of transitions in AI(SA) may not be the same. The reason is that states in 
Ad(SA) having identical set of outgoing paths can merge even when the incoming set of transitions 
are different for each of these states, while such nodes in C(ST) would not have merged. 

Algorithmic aspects of DFA minimization have been well studied. For instance, Hopcroft’s 
algorithm [H71] minimizes an automaton with m transitions over an alphabet of size n in 
0(mlogmlogn) steps and needs at most O(mlogn) space. This running time is shown in [H71] to 
be optimal over the set of regular languages. Additionally, Revuz showed that acyclic automaton 
(which SA indeed is) can be minimized in linear time [R92]. Also, in Appendix A we describe 
an adapted Hopcroft’s algorithm to optimally compress ST. In Figure 6, we give the minimal 
automaton for the simplicial complex of Figure 1. 



Figure 6: Minimal Simplex Automaton of the simplicial complex in Figure 1. 

While there are delicate differences between A4(SA) and C(ST), we will see below that per¬ 
forming basic operations on AI(SA) is not very different from the way it is done for C(ST). 
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3.2.1 Operations on the Minimal Simplex Antomaton 


The set of all paths originating from the root are the same in both ST and Al(SA). All operations 
which involve only traversal along ST are performed with equal (if not better) efficiency in Al(SA) 
as, for every such operation on ST, we start by traversing from the root. As an example, consider 
the operation of determining if a simplex a is in the complex. Let us adapt the algorithm described 
in [BM14] to Al(SA). Note that there is a unique path from the initial state which identifies a 
in Al(SA). U a = V£q — ■■■ — then, from the initial state we go through + 1 states by 
following the transitions io, ■ ■ ■ ,£da ™ that order. If at some point the requisite transition is not 
found then, declare that the simplex is not in the complex. Hence, performing all static operations 
i.e., all operations where we don’t change the Al(SA) in any way, can be carried out in very much 
the same way in both A4(SA) and ST, although it might be more efficient for A4(SA) as discussed 
earlier for C(ST) in subsection 3.1. 

Addition and deletion of simplexes can be trickier in A4(SA) than in ST. We can always expand 
Al(SA) to SA, (locally) perform the operation and recompress. If the nature of the operation is 
itself expensive (i.e., worst-case H(m)) then, the worst-case cost does not change, which is indeed 
the case for operations such as removal of cofaces, edge contraction and elementary collapses. 


3.2.2 Complexity Measure of Size for the Minimal Simplex Automaton 

Our complexity measure in the paper would be minimizing SA to obtain A4(SA) with minimum 
number of states. We know from Myhill-Nerode theorem that there is a unique minimal DFA. Thus, 
if there was an automaton with less transitions than AI(SA) then, it should have more states than 
A4(SA). Let A be an automaton with a minimal number of transitions and let us run Hopcroft’s 
algorithm on A. Since, at no point in Hopcroft’s algorithm, we increase the number of transitions, 
the output has to be an automaton whose numbers of states and transitions are both minimal. 
It follows that the unique automaton output by Hopcroft’s algorithm must have both a minimal 
number of states and a minimal number of transitions. This is proved more formally as Proposition 
1 in [MMR13]. 

In the rest of the paper, we denote by |SA| and |A4(SA)|, the number of states in SA 
and AI(SA) respectively. While we consider the number of states as a complexity measure of 
the size of SA and AI(SA), we will still use the number of edges as a complexity measure of the 
size of ST and C(ST) because the bounds obtained relating the number of edges in C(ST) and the 
number of nodes in C(ST) are not satisfactory. The size of AI(SA) will be discussed in detail in 
section 5, after introducing a new data structure in the next section. This is done to put the impact 
of compression in better perspective. 


4 Maximal Simplex Tree 


We define the Maximal Simplex Tree MxST(A) as an induced subgraph of ST(iL). All leaves in the 
Simplex Tree corresponding to maximal simplices and the nodes encountered on the path from the 
root to these leaves are kept in the Maximal Simplex Tree and the remaining nodes are removed. 
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MxST(i(') is constructed as follows. We start from an empty tree and then insert the words 
representing the maximal simplices of K. Specifically, when inserting the word [cr] = [(.q,- ■ 
start from the root, and follow the path containing successively all labels Iq, • • •, where 
denotes the longest prefix of [cr] already stored in the Maximal Simplex Tree. We then append to 
the node representing [Iq, ■ ■ -,£ 1 ] a path consisting of the nodes storing labels ii+i, ■ ■ ■,ij. Figure 7 
shows the MxST of the simplicial complex given in Figure 1. In MxST(iF), the leaves are in 
bijection with the maximal simplices of K. Any path starting from the root provides the vertices 
of a simplex of K. However, in general, not all simplices in K can be associated to a path from the 
root in MxST(Ar). 



Figure 7: Simplicial Complex of Figure 1 represented using Maximal Simplex Tree. 

By the above construction of MxST, we add at most d + 1 nodes per maximal simplex. Hence, 
MxST(A') has at most k{d+l) + l nodes and at most k{d+l) edges (therefore requiring 0{kdlogn) 
space). We denote by [MxSTj the number of edges in MxST. Since MxST is a factor of ST, 
the size of MxST is usually much smaller than the size of ST. Further, it always meets the lower 
bound of Theorem 1, making it a compact data structure. We discuss below the efficiency of MxST 
in answering queries. 


4.1 Operations on the Mciximal Simplex Tree 

In [BM14] some important basic operations (with appropriate motivation) have been discussed for 
ST. We will bound now the cost of these operations using MxST. Note that any node in MxST(A') 
has 0(n) children and we can search for a particular child of a node in time O(logn) (using red- 
black trees). We summarize in Table I, the asymptotic cost of some basic operations (details of this 
analysis is provided in Appendix B) and note that it is already better than ST for some operations. 

Moreover, we can augment the structure of MxST without paying for a lot of extra memory 
space, so that the above operations can be performed more efficiently. This is explained in section 6. 
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Operation 

Cost 

Identifying maximal cofaces of a simplex a / 
Determining membership of a 

0{kd log n) 

Insertion of a maximal simplex a 

0{kdfj log re) 

Removal of a face 

0{kd log re) 

Elementary Collapse 

0{kdda log re) 

Edge Contraction 

0(kd{k + log re)) 


Table 1; Cost of performing basic operations on MxST. 


5 Results on Minimization of the Simplex Automaton 


In this section we will see some results, both theoretical and experimental on the minimization of 
SA. 

5.1 Bounds on the Number of States of the Minimal Simplex Automaton 

We observe below that the number of leaves in ST is large and grows linearly w.r.t. the number of 
nodes in ST. The proof follows by a simple induction argument on n. 

Lemma 1. If K G IC{n, k, d,m) then, at least half the nodes ofST{K) are leaves. 

Proof. The proof is by induction on the number n of vertices. When n = 1, we have two simplices 
(including the empty simplex) and one leaf. Now assume (induction hypothesis) that for all simpli- 
cial complexes on i vertices the corresponding ST has at least half of its nodes as leaves. Consider 
a ST on the vertex set {l,2,...,z + 1} containing m simplices. Consider the subtree under node 1 
say STi. STi represents a simplicial complex on the vertex set {2,3, ...,i + 1} with node 1 acting 
as the root. Suppose STi has mi nodes. Then, by the induction hypothesis, STi has at least ^ 
leaves. Now consider the rest of ST which can also be independently seen as a Simplex Tree ST 2 
on the vertex set (2, ...,i + 1}. Again, by the induction hypothesis, ST 2 has at least leaves. 

Thus ST has at least y leaves. □ 

Differently from ST, Af(SA) has only one leaf. The following lemma shows that A4(SA) has 
at most half the number of nodes of ST plus one (follows directly from Lemma 1). 

Lemma 2. For any K G IC{n,k,d,m), M{SA{K)) has at most y + 1 states. 

Proof. Since there are m nodes in ST, SA has exactly m states and at least y of them are leaves 
(a state without outgoing transitions). The ^ leaves can be merged. Consequently, the number of 
states of M.{SA{K)) is at most m—+ 1 = + □ 

Similar to AI(SA), we may define A4(MxSA(A')) as the minimal DFA which recognizes only 
maximal simplices as words. Then, the following inequality follows: 
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Lemma 3. For any pure simplicial complex K G ]C{n,k,d,m), |A^(SA(i^))| > |A^(MxSA(Ar))|. 

Proof. Let -A be a pure simplicial complex. Let Sj be the initial state and Sf he the set of the 
states of outdegree zero. To each state we associate its depth, which is the the length of the longest 
directed path from Si to this state. 

We define another automaton B. The states of B are the states of SA(A). The state Si is 
still the initial state and, the new final states are {s € S'/ | depth(s) = d + 1}. Finally, there is a 
transition between states u and v labeled by a if and only if this transition exists in SA{K) and if 
depth(u) = depth(u) + 1. Let us prove that B recognizes exactly the maximal simplices of K. 

Let w he & word recognized by B. Then rc is a word of length d + 1 which was recognized by 
SA(A). Then, it corresponds to a simplex of K of dimension d. It is a maximal simplex of K. 

On the other direction, let u be a maximal simplex of K. Hence the word u is accepted by 
SA{K) and of length d + 1. If any transition which appears during this detection appears also in 
B then, a is also accepted by B. Thus, let us assume for a contradiction that during the detection 
of the word a, the simplex automaton SA(A) uses a transition from a state m to a state v such 
that depth(u) 7 ^ depth(ri) + 1 (let us choose the first transition where it happens). By definition of 
the depth of a state, we get di = depth(u) > depth(tt) + 1 = <52 + 1- This means that there exists 
a word w of length 6 such that by reading w from the initial state, we arrive into v. Then, w, ag 
(where ag is the suffix of a of length d — 62 ) is also accepted by SA(A). Consequently K contains 
a simplex of dimension d + 5i — 62 — l>d and we have reached a contradiction. □ 

In fact, one can prove that for a large class of simplices the equality does not hold. For 
instance, consider 1C' C IC{n,k,d,m) such that for any K G IC' we have that there exists two 
maximal simplices which have different first letters (i.e., when the simplices are treated as words) 
but have the same letter at position i, for some i that is not the last position. For this subclass 
the equality does not hold. Observe also that Lemma 3 holds only for pure simplicial complexes 
because, if all complexes were allowed then, we will have complexes like in Example 1 where 
|M(SA(A))| < |M(MxSA(A))|. 

Example 1. Consider the simplicial complex on seven vertices given by the following maximal 
simplices: a 4-cell 1-2-3-6-7, two triangles 2-3-5 and 4-6-7 and an edge 4-5. 


5.2 Conditions for Compression 

We would like to analyze two possible sources of compression in ST. A first type of compression may 
happen when a simplex a belongs to several maximal simplices and its vertices appear as the last 
vertices of those maximal simplices. Then compression will factorize a so that it will appear only 
once as a common suffix of several words. A second type of compression occurs when considering 
a single maximal simplex. Here too, ST stores many different words with common suffixes and 
compression will factorize these common suffixes. Intuitively, the first type captures compression 
solely in MxST (i.e., because of the input and the labeling on vertices we have defined) and the 
second source analyzes possible compression because of the rich structure of ST. Now, we will see 
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a result which guarantees compression for pure simplicial complexes regardless of the labeling of 
the vertices: 

Lemma 4. For any pure simplicial complex K G }C{n,k,d,m), we have that |A^(SA)| is always 
less than |SA| when k < d and d >2. 

Proof. Let AT be a pure simplicial complex. Let a = vi,... ,Vd,V(i+i be a maximal simplex. Let 
us define u = {ui, ... ,Urf_i} and, M = {m € AT | m is maximal and Vd G m}. We also define 
Pu C V{iy) as the projection of M on to v, which is more formally written as: 

Aj/ = {o C I 3m G M, a = m D u}. 

We notice that v ^ P^. Since d > k > \M\ > \Pu\, it follows that there exists b C u such that 
\b\ = \i>\ — 1 and which is not in P^y. Let and s^ be the states in SA(A') reached by reading the 
words u^Vd and 6 U As the language is closed by subwords and as b C n, any accepting word 
from the state Siy is also an accepting word from the state Sj,. Reciprocally, if w is an accepting 
word from the state Sb then, b U Vd D w is a face of a maximal simplex m. The projection of m 
on r contains b and by definition of b is strictly larger. Hence u <Z m, and so, U U rc G AT. 
Consequently the states Siy and Sb are equivalent and can be merged. □ 


In fact, the above result is close to tight: in Example 4, we have a pure simplicial complex with 
k = d and |ST| = |C(ST)|. We remark here that in cases of impossibility of compression we will 
analyze the compression of ST rather than the minimization of SA through out this section because 
analyzing ST provides better insight into the combinatorial structures which hinder compression. 

Intuitively, it seems natural that if the given simplicial complex has a large number of maximal 
simplices then, regardless of the labeling we should be able to compress some pairs of nodes in 
MxST. However, Example 2 says otherwise. 

Example 2. Consider the simplicial complex on 2n vertices of dimension n/2 defined by the set 
of maximal simplices given by: 


g{i) U {/(f) + n} 


i G { 1,2,..., 


n 

n /2 


,r G {1,2,...,n/2} 


where g is a bijective map from { 1 , 2 ,..., (J/ 2 )} simplices on n vertices of dimension 

n/2 — 1 and g^ corresponds to picking the vertex (in lexicographic order). 


Here A: = f 2 ” 2 ■sJn/'K and there is no compression in MxST. Also note that |C(MxST) | < 

|MxST| does not imply |C(ST)| < |ST| as can be seen in Example 3. 

Example 3. Consider the simplicial complex on seven vertices given by the maximal simplices: 
tetrahedron 1-2-4-6 and three triangles 2-4-5, 3-4-5 and 1-4-7. 


In both Examples 2 and 4, we saw simplicial complexes of large dimension which cannot be 
compressed, but this is due to the way the vertices were labeled. Now, we state a lemma which 
says that there is always a labeling which ensures compression. 
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Lemma 5. If Kg € IC{n,k,d,m) with d > 1 then, we can find a permutation tt on {1,2,... ,n} 
such that |Al(SA(K7roe))| < |SA(i^^o6i)|- 

Proof. Let m = - ■ ■ vi^ be a maximal simplex in Kg. We construct tt by swapping and id 

with n — 1 and n respectively. In ST(Ar^o 6 »)) note that the node under root with label n — 1 and 
the node corresponding to simplex are identical and thus can be merged. □ 

We would have liked to obtain better bounds for the size of AI(SA) through conditions just 
based on n, k, d and m, but sadly this is a hard combinatorial problem. Also, while there is always 
a good labeling, we show in section 7 that it is NP-Hard to find it. 


5.3 Experiments 

We define two parameters here, p^.^. and /O^xst- given by the ratio of |ST| and |C(ST)| 

and the second by the ratio of |MxST| and |C(MxST)|. All experiments performed below record 
the extent of compression of ST. Ideally, we would have liked to record the extent of minimization 
of SA since AI(SA) is more compact than C(ST). Unfortunately, this has not been possible due 
to the lack of available libraries able to handle very large automata. The results below for C(ST) 
are nonetheless positive, substantiating our claim that compression of ST leads to a compact data 
structure. 

Data Set 1: The set of points were obtained through sampling of a Klein bottle in R® and 
constructing the Rips Complex with parameter r using libraries provided by the GUDHI project 
[GUDHI] on input of various values for r. We record in Table 2, |C(ST)| and |C(MxST)| for the 
various complexes constructed. 


No 

n 

r 

d 

k 

|ST| = m- 1 

|MxST| 

|C(ST)| 

PsT 

|C(MxST)| 

PmxST 

1 

10,000 

0.15 

10 

24,970 

604,572 

96,104 

218,452 

2.77 

90,716 

1.06 

2 

10,000 

0.16 

13 

25,410 

1,387,022 

110,976 

292,974 

4.73 

104,810 

1.06 

3 

10,000 

0.17 

15 

27,086 

3,543,582 

131,777 

400,426 

8.85 

123,154 

1.07 

4 

10,000 

0.18 

17 

27,286 

10,508,485 

149,310 

524,730 

20.03 

137,962 

1.08 


Table 2 : Analysis of experiments on Data Set 1. 

First, observe that |MxST| is considerably smaller than |ST|. This is expected, as it is likely 
that k is polynomially related to n for Rips complexes. Also, while we observe insignificant com¬ 
pression in MxST, / 9 g.j, increases rapidly as r is increased. This indicates that compression strongly 
exploits the combinatorial redundancy of ST (i.e., storing each simplex explicitly through a node) 
and works particularly well for the Simplex Tree. 

Data Set 2: All experiments conducted above are for Rips complexes with ^ small. We now 
check the extent of compression for simplicial complexes with large To this aim, we look at flag 
complexes generated using a random graph Gn,p on n vertices where a pair of vertices share an 
edge with probability p, and record in Table 3, |C(ST)| and |C(MxST)| for the various complexes 
constructed. 
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No 

n 

P 

d 

k 

|ST| =m-l 

|MxST| 

|C(ST)| 

PsT 

|C(MxST)| 

PmxSt 

1 

25 

0.8 

17 

77 

315,369 

587 

467 

537.3 

121 

4.85 

2 

30 

0.75 

18 

83 

4,438,558 

869 

627 

7,079.0 

134 

6.49 

3 

35 

0.7 

17 

181 

3,841,590 

1,592 

779 

4,931.4 

245 

6.50 

4 

40 

0.6 

19 

204 

9,471,219 

1,940 

896 

10,570.6 

276 

7.03 

5 

50 

0.5 

20 

306 

25,784,503 

2,628 

1,163 

22,170.7 

397 

6.62 


Table 3: Analysis of experiments on Data Set 2. 


Here we observe staggering values for which increases as the simplicial complex grows larger. 
This is primarily because random simplicial complexes don’t behave like pathological simplicial 
complexes (such as Examples 2 and 4) which hinder compression. 


6 Simplex Array List 


In this section, we build a new data structure which is a hybrid of ST and MxST. The Simplex Array 
List SAL(Ar) is a (rooted) directed acyclic graph on at most k nodes with maximum 

out-degree d, which can be obtained by modifying MxST or can be constructed from the maximal 
simplices of K. We describe the construction of SAL below. 


6.1 Construction 

We will first see how to obtain SAL from MxST by performing three operations which we define 
below. 

1. Unprefixing (U): Excluding the root and the leaves, for every node v in MxST with outde- 
gree dy, duplicate it into dy nodes with outdegree 1, (one copy of v for each of its children) 
by starting from the parents of the leaves and recursively moving up in the tree. 



Eigure 8: Unprefixing the Maximal Simplex Tree of Figure 7. 
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2. Transitive Closure (T): For every pair of nodes {u,v) in ZY(MxST) (n not being the root), 
if there is a path from u to v then, add an edge from rt to u in T(W(MxST)) (if it doesn’t 
already exist). 



Figure 9: Transitive Close of the Unprefixed Maximal Simplex Tree of Figure 8. 


3. Expanding Representation(7^): For every node v in 7”(Z//(MxST)) with outdegree dy, 
duplicate it into dy nodes with outdegree 1, i.e., one copy of v for each of its children, by 
starting from the children of the root and recursively moving down to children of smallest 
label. As a demonstration, in Figure 10, we show the result of expanding one node in the 
T{U{MxST) of Figure 9. 



7?.(T(Z^(MxST))) is the Simplex Array List. From the construction of SAL, it is clear that 
each node in SAL uniquely represents an edge in the simplicial complex. Figure 11 shows SAL 
representation of the simplicial complex given in Figure 1. 

We will now see an equivalent construction of SAL from its maximal simplices and it is this 
construction we will use to perform operations. For a given maximal simplex a = associate 

a unique key between 1 and k generated using a hash function Ti and then introduce + 1 new 

nodes in SAL. We build a set of + 1 labels and assign uniquely a label to each node. The 

set of labels is defined as the union of the following two sets (cf. Figure 12 for an example): 

Si = {{£iJi>,n{a)) I i € {0, 1 ,.. . , j - l},i' € {z + 1, ... ,j}} 

where (p denotes an empty label. We introduce an edge from node with label to node 

with label if and only if p' = q. Additionally, we introduce an edge from every node 

with label {£p,£j,'H{a)) in Si to the node with label in S' 2 . Thus, in SAL we represent 
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Figure 11: Simplicial Complex of Figure 1 represented using 7^(T(W(MxST))). 


a maximal j-simplex using a connected component containing |S'i| + |52| = + 1 nodes and 

—- directed edges. To perform basic operations efficiently, we embed SAL on the number line 
such that for every i G {1,2,... ,n}, we have an array Ai of nodes which has labels of the form 
{i, i', z) for some z & {1,... , k} and i' G {i + 1,... ,n, ip}. Sort each Ai based on i' and in case of 
ties, sort them based on z. 

The resultant graph obtained after removing the root in TZ{'T (ZY(MxST))) is the same as the one 
described in the previous paragraph. Labels (as described above) for the nodes in 7^(T(L/(MxST))) 
can be easily given by just looking at the vertex represented by the node, and its children. 

We remark here that we use hash function Ti to generate keys for simplices because it is an 
efficient way to reuse keys (in case of multiple insertions and removals). 



Figure 12: Simplex Array List for complex in Figure 1 embedded on the number line. 
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6.2 Some Observations about the Simplex Array List 


SAL(A) has at most k _j_ nodes. Also, for each maximal simplex of dimension d^, the 

outdegree of any node in the connected component corresponding to the maximal simplex, is at 
most da- Therefore, the total number of edges in SAL(A) is at most k Further, in 

each node we store the labels of two vertices (which requires log n bits) and a hashed value (which 
requires log A; bits). Hence, the space required to store SAL(iF) is 0(/c(i^(d + logn + log fc)). Also, 
unless otherwise stated |SAL| refers to number of edges in SAL. Since SAL is constructed 
from Z//(MxST), we have the following lemma: 

Lemma 6. The number of nodes and edges in SAL are both invariant over the labeling of the 
vertices in the simplicial complex. 

Intuitively, SAL is representing K by storing all the edges of K explicitly as nodes in SAL(iL) 
and the edges in SAL(Ar) are used to capture the incidence relations between simplices. More 
precisely, a path of length j in SAL(Ar) corresponds to a unique j-simplex in K. We now see that, 
differently from MxST, the simplices of K are all associated with paths in SAL(iL). We say a path 
p is associated to a simplex a if the sequence of numbers obtained by looking at the corresponding 
nodes which are embedded on the number line along p are exactly the labels of the vertices of a in 
lexicographic order. 

Lemma 7. Any path in SAL(iL) is associated to a simplex of K and any simplex of K is associated 
to at least one such path. 

Proof. For any path in SAL(iL), it belongs to a connected component of SAL(iL). In particular, 
there exists a maximal simplex m such that all nodes in this component are of the form (a, b, Ti{m)) 
for some a, h such that Va is a vertex of m. Consequently, all vertices read during this path belong 
to m, it means that the corresponding simplex is a face of m, so it belongs to K. 

On the other hand, if it = ■ ve^ is a simplex of K, it belongs to some maximal simplex m = 

- where we have ... ,f’r} ^ In SAL(Ar), we look at the connected 

component associated with m. While we can associate any node with label (a, b, z) uniquely to 
edge a — b, we can also uniquely identify to vertex a. We associate a to the path: 

{£i,£2,'H{m)) - {£2,£3,'H{m)) - {£r,x,'H{m)), 

for some x G {A + 1,..., u, (p}. The existence of the path is confirmed because of the way we 
introduced edges in the connected component. □ 

Observe that several paths can provide the same simplex since a simplex may appear in several 
maximal simplices. Hence, the vertices of a given simplex cannot be accessed in a deterministic 
way. The previous lemma together with this observation implies that SAL is a non-deterministic 
finite automaton (NFA). NFA are a natural generalization of DFA. The size of a NFA is smaller 
than that of a DFA detecting the same language, but the operations on NFA take in general more 
time. We demonstrate the above fact using Example 4. 
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Example 4. Let K € ]C{2k + 1, k, k, m) he defined on the vertices {1,... , 2/c + 1} and the set of 
maximal simplices be given by {({1, ... ,k + 1}\ {i}) U{A: + l + i}|l<i< k}. 

Thus SAL(i^) has ^ nodes while A4{SA{K)) has at least 2^ states (all states reached 

after reading the words s C {1,..., A:} are pairwise distinct). Moreover, this motivates the need for 
considering SAL over A4(SA), as the gap in their sizes can be exponential. 

Building SAL(Ar) can be seen as partially compressing the Simplex Tree ST((t) associated 
to each maximal simplex a (where a and its subfaces are seen as a subcomplex). Compressing 
ST(it) will lead to a subtree which is exactly the same as the transitive closure of MxST(cr). 
Therefore, collecting all C(ST)(cj) for all maximal simplices cr and merging the roots is the same as 
T(A/(MxST(iL))). Now applying TZ on T(A/(MxST(iL))) can be seen as an act of uncompression. 
We apply TZ once to ensure that for every node, all its children represent the same vertex and thus 
belong to the same Aj. If 7^ is applied multiple times then, it is equivalent to duplicating nodes 
(seen as an act of uncompression) to get all children of a node closer together inside Aj. Next, we 
discuss below how to perform operations in SAL at least as efficiently as in ST. 


6.3 Operations on the Simplex Array List 

Let us now analyze the cost of performing basic operations on SAL (the motivation behind these 
operations are well described in [BM14]). Denote by rj(cr, r) the number of maximal simplices that 
contain a j-simplex r which is in a. Define rj(cj) = max Tj(a,T) and Tj = max rj((T). It is easy 

to see that A>ro>ri>--->rrf = l. In the case of SAL, we are interested in the value of Ti 
which we use to estimate the worst-case cost of basic operations in SAL. 

Membership of Simplex. To determine membership of cr = • • • V£^ in K, first determine 

the contiguous subarrays of A^^,..., A^^^, say ..., such that every B^. contains all nodes 
with labels of the form (fj, fj+i, z), for some 2 ; {B^fis indeed form a contiguous subarray because 
of the way elements in A^. were sorted). We emphasize here that we determine each B^. only by 
its starting and ending location in A^. and do not explicitly read the contents of each element in 
B^.. Thus, if P is a projection function such that P((£j, z)) = 2 ; then, we see each P{B£.) 

as a subset of {1,..., /c} because the only part of the label that distinguishes two elements in P^. 
is the hash value of the maximal simplex. Now we have cr G A if and only if n P{B£.) 0. 

0<i<dcr ^ 

This is because if cr G A then, from Lemma 7 there should exist a path corresponding to this 

simplex which would imply riP{B£.) 7 ^ 0, and if r G r\P{B£.) then, cr is a face of r. Computing the 

i i ^ 

intersection can be done in 0{'~fdij log f) time, where 7 = min|P£. | and f = maxjA^J. Computing the 

i ^ i ^ 

subarrays can be done in 0{da log () time. Thus, the total running time is 0{da{'y log ( + log ()) = 
0{d^Tilog{Tod)) = 0{d^Tilog{kd)). 

For example, consider the SAL of figure 12 and we have to check the membership of cr = 
2 — 3 — 5 in the complex of figure 1. Then, we have P 2 = {(2,3,2)}, P 3 = {(3,5,1), (3, 5,2)}, 
and P 5 = 1(5, ^ 5 ,1), (5, (/j, 2)}. We see each P{Bi) as a subset of {1,2,3} as follows: P(P 2 ) = 
{2},P{Bs) = {1,2}, and P{B^) = {1,2}. Clearly nP(Pi) = {2} 7 ^ 0, and a is indeed a face of the 

i 

second maximal simplex 2 — 3 — 4 — 5. 
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Insertion. Suppose we want to insert a maximal simplex a then, building a connected component 
takes time 0{d^). Updating the arrays Ai takes time 0{d‘^logC)- Next, we have to check if there 
exists maximal simplices in K which are now faces of a, and remove them. We consider every edge 
iTA in a and compute Za the set of all maximal simplices which contain ua (which can be done in 
time 0{d%Ti log(rod))). Then, we compute U whose size is at most d'^Ti and check if any of 

(TaGo- 

these maximal simplices are faces in a (can be done in 0{d^Ti) time). To remove all such faces of 
a which were previously maximal takes time at most 0{d'^Ti). Therefore, total time for insertion 
is 0 {dlVi{d„ + logiFod))) = OidlTiid^ + log{kd))). 

Removal. To remove a face a, obtain the maximal simplices which contain it (can be done 
through a membership query in ©(d^Ti log(ro(i)) time). There are at most T^^ maximal simplices 
containing a. Next, remove the above maximal simplices and then insert the facets of the above 
maximal simplices which do not contain a. More precisely, for each of the above obtained maximal 
simplices make d^ copies of the corresponding connected component, and in the copy delete all 
nodes with label (cr,, x, y) for some x, y, and where ai denotes the label of the vertex of a. Note 
that one has to first check if the facets of the maximal simplices not containing a are indeed maximal 
before inserting them. Thus, the total running time is C>(r^^ (do-((i^ log Tod + dTilogTod))) = 
0{TdJdM^ + Ti)\ogkd). 

Elementary Collapse. A simplex r is collapsible through one of its faces a, if r is the only coface 
of fj. Such a pair {a, r) is called a free pair, and removing both faces of a free pair is an elementary 
collapse. Given a pair of simplices (cr, r), to check if it is a free pair is done by obtaining the list 
of all maximal simplices which contain u, through the membership query (costs ©(do-Ti log(rod)) 
time) and then checking if r is the only member in that list. If yes, remove r and insert the facets 
(except a) which are now maximal after removing r. This takes time 0(d^ Td^Ti log(rod)). Thus, 
the total running time is 0 {d^{d'^ + Ti log(rod))) = 0(d^(d^ + Ti log{kd))). 

Edge Contraction. Here we cannot do better than entirely rebuilding the parts of SAL corre¬ 
sponding to maximal simplices which contain the vertices in the edge to be contracted and therefore 
the cost of the operation is O(ro(d(ri \og{kd) + d‘^))), as the number of maximal simplices a vertex 
may be part of is at most Tq (by definition) and checking if a simplex remains maximal after the 
edge contraction can be done through the membership query. 

We summarize in Table 4 the asymptotic cost of basic operations discussed above and compare 
it with ST and MxST, through which the efficiency of SAL is established. 



ST 

MxST 

SAL 

Storage* 

0 {k2'^ logn)l 

0 {kdlogn) 

0 {kd^{d + logn + log fe)) 

Membership of a simplex a 

0 {da logn) 

0 {kdlogn) 

©(do-Ti log(fed)) 

Insertion of a simplex a 

0 {2^‘^drj logn) 

0 {kdlogn) 

0 {diTi{da + log(fed))) 

Removal of a face 

O(mlogn) 

0 ( fed logn) 

0(rrf,dd,(d^ + ri)logfcd) 

Elementary Collapse 

0 (2'^'^ logn) 

0 {kdda log n) 

o(d^(d^ + riiog(fed))) 

Edge Contraction 

0 {md) 

0 {kd{k + logn)) 

O(ro(d(riiog(fed) + d"))) 


Table 4: Cost of performing basic operations on SAL in comparison with ST and MxST. 
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Performance of SAL. Plainly, if the number of maximal simplices is small (i.e., can be considered 
as a constant), SAL and MxST are very efficient data structures and this is indeed the case for a 
large class of complexes encountered in practice as discussed in section 2 . 

Remarkably, even if k is not small but d is small then, SAL is a compact data structure as 
given by the lower bound in Theorem 1. This is because 0(fcd^(d + logn + logfc)) bits are sufficient 
to represent SAL and the lower bound is met when d is fixed (as it translates to needing 0{klogn) 
bits to represent SAL). Also, it is worth noting here that Tq is usually a small fraction of k and 
since Ti is at most Tq, the above operations are performed considerably faster than in MxST where 
almost always the only way to perform operations is to traverse the entire tree. Indeed SAL was 
intended to be efficient in this regard as even if k is not small the construction of SAL replaces 
the dependence on A; by a dependence on a more local parameter Ti that reflects some “local 
complexity” of the simplicial complex. As a simple demonstration, we estimated ro,ri,r 2 , and Ta 
for the simplicial complexes of Data Set 1 (see section 5.3). These values are recorded in Table 5. 


No 

n 

r 

d 

k 

m 

To 

Li 

T 2 

Tg 

|SAL| 

1 

10,000 

0.15 

10 

24,970 

604,573 

62 

53 

47 

37 

424,440 

2 

10,000 

0.16 

13 

25,410 

1,387,023 

71 

61 

55 

48 

623,238 

3 

10,000 

0.17 

15 

27,086 

3,543,583 

90 

67 

61 

51 

968,766 

4 

10,000 

0.18 

17 

27,286 

10,508,486 

115 

91 

68 

54 

1,412,310 


Table 5 : Values of Tq, Ti, r 2 , and Ta for the simplicial complexes generated from Data Set 1 . 

It is interesting to note that the size of SAL is larger than the size of C(ST) but much smaller 
than the size of ST. This is expected, as SAL promises to perform most basic operations more effi¬ 
ciently than ST while compromising slightly on size. Further our intuition, as described previously, 
was that Tq should be much smaller than k, which is supported by the above results. Also, we 
note that for larger simplicial complexes such as complexes No 3 and 4, there is a noticeable gap 
between Tq and Ti. Since complexity of basic operations using SAL is parametrized by Ti (and 
not To), the above results support our claim that SAL is an efficient data structure. 

Local Sensitivity of Simplex Array List. It is worth noting that while the cost of basic 
operations are bounded using Ti, we could use local parameters such as 7 and (see previous 

paragraphs on Membership of Simplex and Insertion for definition) to get a better estimate on the 
cost of these operations. 7 captures local information about a simplex a sharing an edge with other 
maximal simplices of the complex. More precisely, it is the minimum, over all edges of cr, of the 
largest number of maximal simplices that contain the edge. If a has an edge which is contained 
in a few maximal simplices then, 7 is small. captures another local property of a simplex a - 
the set of all maximal simplices that contain the edge cja- Therefore, SAL is sensitive to the local 
structure of the complex. 

*We would like to recapitulate here the lower bound from Theorem 1 of Q{kd\ogn). 

Mhe space needed to represent ST is 0(mlogn) which is written as D(fc2'^logn) to help in comparison. 
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6.4 A Sequence of Representations for Simplicial Complexes 

We can use the operation TZ to generate a sequence of data structures, each more powerful than the 
previous ones (but also bulkier). More formally, consider the sequence of data structures (A), where 
A_i = MxST, Ao = W(MxST) and A = 7^*(r(^4(MxST))), for all i E N. Note that Ai = SAL. 
Further, for all i E N, we will refer to the data structure Aj by the name i-SAL (we will continue 
to refer to 1-SAL as SAL). 

Further, we see that in the element of the sequence, every node which is not a leaf (sink) in 
the data structure corresponds to a unique i-simplex in the simplicial complex. Also for all i-SAL, 
i E N, we have that it is a NFA recognizing all the simplices in the complex. As we move along the 
sequence, the size of the data structure blows up by a factor of d at each step. But in return, we 
gain efficiency in searching for simplices as the membership query depends on Fj which decreases 
as i increases. 

In section 6.4.1, we will see how to construct 0-SAL (i.e., Z4(MxST)) from the maximal simplices 
of a simplicial complex and in section 6.4.2, we will see how to construct 7^(7^(T(W(MxST)))) = 
2-SAL from the maximal simplices of a simplicial complex, and this would help to demonstrate the 
construction of data structures which appear later in the sequence (A). 


6.4.1 Unprefixed Maximal Simplex Tree 

The Unprefixed Maximal Simplex Tree Z4(MxST) or 0-SAL is a directed acyclic graph which can 
be obtained by modifying MxST or can be constructed from the maximal simplices of K. We 
will describe the latter here. We initially have n empty arrays Ai,... ,An and for every maximal 
simplex a = ■■ ■ v ^., associate a unique key between 1 and k generated using a hash function 

TL and insert 'H(ct) in the arrays ..., . Thus determining membership of a simplex again 

reduces to computing set intersection and insertion and removal of simplices are performed as in 
MxST but here we are equipped with a faster search operation (i.e., more efficient membership 
query). We summarize in Table 6 the asymptotic cost of basic operations for 0-SAL and compare 
it with MxST. 


Operation 

Cost for MxST 

Cost for 0-SAL 

Membership of a simplex a 

0 (kd log n) 

0{d„TQ log A:) 

Insertion of a simplex a 

0 {kd log n) 

0 {daTQ{da + log/c)) 

Removal of a face 

0 {kd log n) 

O^dadToYd, \ogk) 

Elementary Collapse 

0 {kdda log n) 

0 {diTo log/c) 

Edge Contraction 

0 (kd{k + logn)) 

0 {Tld\ogk) 


Table 6: Cost of performing basic operations on 0-SAL which is compared with cost of such 
operations on MxST. 

1-SAL has two advantages over 0-SAL. First, 1-SAL stores all simplices through its paths 
(Lemma 7) and this may help in storing filtrations. In addition, it is likely that Fi is significantly 
smaller than Fq. 
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6.4.2 2-Simplex Array List 


The 2-Simplex Array List 2-SAL(A) is a directed acyclic graph which can be obtained by modifying 
MxST (i.e., 2-SAL = 7^(7^(T(W(MxST))))) or can be constructed from the maximal simplices of 
K. We will describe the latter here. For a given maximal simplex a = - ■ ■ V£. , associate a unique 

key between 1 and k generated using a hash function LL and then introduce + l new nodes 

in 2-SAL. We build a set of 1 labels and assign uniquely a label to each node. The set of 

labels is defined as the union of the following three sets (cf. Figure 13 for an example): 

51 = I € {0,1,... ,j - 2},Z2 G {ii -1- 1,. .. ,j - l},i3 G fe + 1, - • • ,j}} 

5 2 = {(^i, I i G {0,1,... ,j - 1}} 

53 = 


where ip denotes an empty label. Now introduce an edge between two nodes with labels 

if and only if 12 = ii and = 15 . Next, introduce 
an edge between two nodes with labels £i^),Ll{a)) and {£i^,{£j,ip),Ll{a)) if and only if 

^2 = H and = j- Finally, introduce an edge from every node with label in S 2 to the node with 
label in S 3 . Thus, in 2-SAL we represent a maximal j-simplex using a connected component con¬ 
taining I I -|- 1^21 -|- 1 53 1 = + 1 nodes. To perform basic operations efficiently, embed 2-SAL 

on the number line such that for every i G {1, 2,... , n} on the number line we have an array Ai of 
nodes which has labels of the form {i, (a, b), z) for some z G {1,..., /c} and o, 6 G {i -|- 1,..., n, ip}. 
Sort each Ai based on a and in case of ties, sort them based on b and in case of further ties sort 
them based on 2 ;. 



Figure 13: 2 - Simplex Array List for complex in Figure 1 embedded on the number line. 

2-SAL(A) has at most k nodes and, for each maximal simplex of dimension 

the outdegree of any node in the connected component is at most da. Therefore, the total number 
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of edges in 2-SAL(ii') is at most k _|_ (i ^, 

0{kd^{d + logn + log k)). 


Hence, the space required to store 2-SAL(i^) is 


We summarize in Table 7 the asymptotic cost of basic operations for 2 - SAL and compare it 
with SAL. All the basic operations are performed similar to the way we did in SAL except that here 
we have more structure and thus searching becomes more efficient (Li is replaced by the smaller 
r 2 ) but pay extra in size because we have to maintain the additional structure. 


Operation 

Cost for SAL 

Cost for 2-SAL 

Membership of a simplex a 

Oid^iVilogikd))) 

0{d^T2log{kd^)) 

Insertion of a maximal simplex a 

o{d:i{r,d, + iog{kd))) 

0{dlT2{dl + \og{kd^))) 

Removal of a face 

OidJ^iT^ +log{kd))) 

Oir^jd^id-^+ r2)iog{kd^)) 

Elementary Collapse 

oidi{dl + ri + iog{kd^))) 

0{di{di + T 2 log{kd^))) 

Edge Contraction 

0{To{d{rilog{kd) + d^))) 

OiTo{d{T2logikd^) + d^))) 


Table 7: Cost of performing basic operations on 2-SAL which is compared with cost of such 
operations on SAL. 


6.4.3 The Bottom Line 

A natural question to resolve is which element in (A) should one pick for representing a simplicial 
complex. This indeed depends on the nature of data and the type of complex. For instance, 
consider the case of a Rips complex whose vertex set is a good sample of a smooth manifold of low 
intrinsic dimension. Then, we would settle for either Aq or Ai as we expect Tq itself to be quite 
low. However if the simplicial complex is of high dimension and if there are central simplices on 
which most maximal simplices are built on then, it might be better to look at elements higher up in 
the sequence as T might be quite high in the beginning and can collapse quickly after some point. 


7 Labeling Dependency 


In this section, we discuss how the labeling of the vertices affects the size of the data structures 
discussed in this paper. In particular, the size of both ST and SAL are invariant over the labeling 
of vertices in a simplicial complex (see Lemma 6). However, this is not the case with MxST 
and AI(SA). To see this, consider a simplicial complex which contains a maximal triangle and a 
maximal tetrahedron, sharing an edge. We could label the triangle and tetrahedron as 1--2-3 and 
1-2-4-5, or as 1-3-4 and 2-3-4-5 respectively. Note that the two labelings give two A4(SA) (and 
MxST) of different sizes. 

In all of the hardness results we will see, the reduction will be from vertex cover for some 
specific graphs, where graphs are considered as 1-dimensional simplicial complexes. More formally, 
given a graph G on vertex set V and edge set E, the associated 1-dimensional simplicial complex 
is defined on the vertex set V which admits all edges in E as 1-dimensional simplices. When there 
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is no confusion, we will refer to this simplicial complex also as G. 

We observe that MxST(G) is of height 2. In layer 0 we have the root, and in layer 2 we have 
all the leaves (we may have some leaves in layer 1 if there are vertices in G of degree zero). Since 
the number of leaves in the MxST is equal to the number of maximal simplices in G which is in 
turn equal to the number of edges in G (plus the number of zero-degree vertices), regardless of the 
ordering on vertices in G, the number of nodes in layer 0 and 2 are fixed to 1 and \E\ respectively. 
Further note that vertices on layer 1 form a vertex cover of G. 

We will similarly analyze C(ST(G)). ST(G) can be obtained from MxST(G') by introducing 
nodes on layer 1 (and edges from the root to these nodes) such that all vertices appear in nodes 
of layer 1 of ST(G). This means that in ST(G), we have the root in layer 0, |y| nodes on layer 1, 
and I FI I nodes on layer 2. We recapitulate here that the size of ST is invariant over the ordering of 
vertices (as can be observed above - |ST(G')| = \V\ + IT’D. More importantly, for 1 - dimensional 
simplicial complexes we have that |C(ST)| = |ST|. Therefore to analyze the impact of labeling on 
the size of C(ST), we will have to move to higher dimensional complexes (which we indeed do). 

7.1 Optimal Labeling for the Mciximal Simplex Tree 

The size of MxST is very sensitive to the labeling of the vertices. For instance, in Example 5, by 
reversing the labeling of vertices, we increase the size of MxST(iF) by a factor of order k. 

Example 5. Let K G lC{d + k, k, d, m) whose set of maximal simplices is {1,2,... ,d,d + i\l < i < 
k}. 


First, we formalize the label ordering problem on MxST: Given an integer a and a sim¬ 
plicial complex Kq G }Cg{n,k,d,m), does there exist a permutation vr of 1,2, ...,n such that 
|MxST(iF 7 roe)| < ct? Let us refer to this problem as MxSTMlNIMIZATION(Ff 5 i, a) and from Theorem 
17 of [BM16], we know that it is NP-Complete even for 1-dimensional complexes. 

Theorem 2 ([BM16]). MxSTMINIMIZATION is NP-Complete. 

Proof. Clearly MxSTMINIMIZATION is in NP. We will now show that it is NP-Hard as well. Given 
a connected graph G on vertex set V (with 9 being a labeling of the vertices from 1 to |P|) and 
edge set E, we define Kg to be the 1-dimensional simplicial complex associated to G. Since G is 
connected, we have that all the leaves in MxST(F75i) appear in layer 2. Thus finding a vertex cover 
for G of size at most a is equivalent to finding an ordering vr on the vertices in Kg (to determine 
which of these should appear in nodes in layer 1 of the MxST) such that |MxST(Fr^o6»)| < a + \E\. 
Since vertex cover problem is NP-Hard [K72], we have that MxSTMINIMIZATION is also NP-Hard. 
Thus MxSTMINIMIZATION is NP-Complete. □ 

Intuitively, finding a good labeling for C(MxST) seems harder than for MxST. More for¬ 
mally, given an integer a and a simplicial complex Kg G ICg{n,k,d,m), does there exist a per¬ 
mutation vr of l,2,...,n such that |C(MxST(F7^o6»))| < ct? Let us refer to this problem as 
CMxSTMlNIMIZATION(iF 0 , a). Corollary 1 easily follows from Theorem 2 as, for any 1-dimensional 
simplicial complex and any fixed labeling, |C(MxST)| = |MxST|. 
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Corollary 1. CMxSTMINIMIZATION is NP-Complete. 


Proof. It is clear that CMxSTMINIMIZATION is in NP. We know from the proof of Theorem 2 
that it is NP-Hard to decide MxSTMINIMIZATION even for pure simplicial complexes of dimen¬ 
sion 1. We will thus show a reduction from MxSTMINIMIZATION for such simplicial complexes to 
CMxSTMINIMIZATION. For any pure simplicial complex K of dimension 1 under any labelling 9 of 
its vertices, the size of MxST and C(MxST) for Kg are the same. Thus any solution for an instance 
of MxSTMINIMIZATION is also a solution for the same instance for CMxSTMINIMIZATION and vice 
versa. □ 

7.2 Optimal Labeling for the Minimal Simplex Automaton 

To prove that finding a good labeling for At(SA) is hard, we use a reduction from another instance 
of vertex cover for a special class of graphs. We say a graph G is square-free if, for every two 
vertices u,v in G, the number of common neighbors in G is at most 1. 

Lemma 8. Vertex Cover problem on square-free graphs is NP-Hard. 

Proof. We observe that in the reduction from SAT to 3-SAT (Theorem 3.1. of [GJ79]), every clause 
has exactly three distinct variables^. Next, we observe that if every clause has exactly three distinct 
variables then, in the reduction from 3-SAT to Vertex Cover (Theorem 3.3. of [GJ79]), the graph 
constructed does not have a cycle of length four. □ 

We will now formalize the decision problem for AI(SA). Given an integer a and Kg G 
ICg{n, k, d, m), does there exist a permutation vr of 1, 2,... , n such that |Al(SA)(iAjro6»)| < a? Let us 
refer to this problem as MSAMlNIMIZATION(iA0,a). We work with square-free graphs because by 
using such graphs for building simplicial complexes, we would overcome scenarios in which we have 
two states in Al(SA) with identical set of outgoing transitions but have different sets of incoming 
transitions. 

Theorem 3. MSAMINIMIZATION is NP-Complete. 

Proof. It is clear that MSAMINIMIZATION is in NP. We will now show that it is NP-Hard as well. 
Given a square-free graph G on vertex set V and edge set E, and a labeling 9 from 1 to |V| on 
the vertices, we define Kg to be the 1-dimensional simplicial complexes associated to G. In the 
following paragraphs we will prove that G has a vertex cover of size at most a if and only if there 
exists an ordering vr on the vertices of Kg such that |AI(SA(Ar 7 ro 0 ))| < 0 - 1 - 2 . 

First, we prove the forward direction. We define the height of a state as the length of the longest 
sequence of transitions from the initial state to that state. We define the height of an automaton as 
the maximum over the height of all states in the automaton. Thus Ad(SA(Ar 0 )) is of height 2 . At 
height 0 we have the initial state, and at height 2 we have a single state. Transitions from states in 
height 1 to height 2 correspond to the edges of G regardless of the ordering on the vertices in Kg. 

and -tx are considered as the same variable. 


27 



Note that vertices at height 1 form a vertex cover of G. Thus if G has a vertex cover of size at most 
a then, we can construct an ordering vr on the vertices of Kq (by allowing all vertices appearing in 
the vertex cover to appear before the remaining vertices) such that |A^(SA(ii',ro 6 i))| < a + 2. 

Now, we prove the reverse direction. Suppose there exists an ordering vr on the vertices of K 0 
such that |Ad(SA(A'.n-o 6 i))| < a + 2. If there is a state s in M{SA{KT^oe)) at height 1 which has more 
than one incoming transition then, it cannot have more than one outgoing transition because G is 
square-free. In this case, the outgoing transition of s is rewired to be the incoming transition from 
the initial state and the incoming transitions are rewired to be the outgoing transitions from s to 
the single state at height 2 . Once this swapping (i.e., rewiring) is done for each state at height 1, we 
have ensured that the number of incoming transitions for such states is one (because the number 
of outgoing transitions from these states before rewiring was one). Additionally, we note that by 
performing the above rewiring in AI(SA(Ar 7 roe)) we have not introduced (or removed) any new 
states, and thus the size of AI(SA(A'.n-o 6 i)) has not changed (and it is computing the same language 
as before). Therefore, choosing the set of all labels of outgoing transitions from the initial state to 
states at height 1 gives a subset of the vertex set of size at most a and does indeed form a vertex 
cover of G. Prom Lemma 8 we know that vertex cover is NP-Hard for square-free graphs, which 
implies that MSAMINIMIZATION is also NP-Hard. Thus MSAMINIMIZATION is NP-Complete. □ 

We believe that to obtain reasonably small sized MxST and A4(SA) one has to label vertices 

def 

in decreasing order of kv = number of maximal simplices containing vertex v. Thus, in practice 
one may use this heuristic to find a good labeling. 

7.3 Optimal Labeling for the Compressed Simplex Tree 

We had observed in Section 3.2 the delicate relationship between the sizes of At(SA) and C(ST). 
Additionally, we have that the complexity measure of sizes for A4(SA) and C(ST) are not coherent 
and thus, we will not be able to use Theorem 3 here to prove hardness result. On the other hand, we 
cannot (trivially) extend the reduction from MxSTMINIMIZATION to that of finding good labelings 
for C(ST) because, as stated earlier, for 1-dimensional simplicial complexes we have |C(ST)| = |ST|. 
Instead we append the simplicial complex (associated to) G and construct 2-dimensional complexes 
which we then use to prove hardness result for finding good labelings for C(ST). 

Let us first formalize the decision problem. Given an integer a and a simplicial complex 
Kg G ICg{n,k,d,m), does there exist a permutation vr of 1,2, ...,n such that |C(ST(iL 7 ro 6 »))| < 
a? Let us refer to this problem as CSTMlNIMIZATION(iL0, a). To prove the hardness result for 
CStMINIMIZATION, we provide a reduction from a special instance of vertex cover problem which 
will be shown to be NP-hard. Specifically, we restrict vertex cover problem to a special class of 
graphs we call as the King-Maker graphs. A graph G is called a King-Maker graph if there exists 
two vertices u,v in G such that u is connected to all vertices in G (and thus we shall fondly refer 
to this vertex as the king) and v is of degree 1 . 

Lemma 9. Vertex Cover problem on King-Maker graphs is NP-Hard. 

Proof. Given a graph G on vertex set V and edge set E, we build a King-Maker graph G' by 
adding two new vertices u and v, and adding an edge between every vertex in G and u, and an 
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edge between u and v. If G has a vertex cover of size a then, G' has a vertex cover of size a + 1 
and vice versa. □ 

We are now equipped to prove NP-Hardness of CStMINIMIZATION and follow a reduction 
similar to that described in proof of Theorem 2 and the reduction from King-Maker graphs, helps 
us ensure that (i) there exists a vertex cover of smallest size which contains the king, and (ii) every 
vertex of the graph (except the king) appears in nodes of layer 2 of C(ST). Also, we append the 
simplicial complex (associated to) G by introducing a new vertex and extend G through insertion 
of triangles. 

Theorem 4. CStMINIMIZATION is NP-Complete. 

Proof. It is clear that CStMINIMIZATION is in NP. We will now show that it is NP-Hard as well. 
Let G be a King-Maker graph on vertex set V and edge set E, and 0 be a labeling of the vertices 
from I to |K|. We define Kq to be the 1-dimensional simplicial complex associated to G. We then 
introduce a new vertex v in Kq with label |K| -|- 1 and form maximal triangles, each consisting of 
V together with a maximal edge of Kg. We denote this appended Kg by K'^. Thus, we still have 
the same number of maximal simplices but the dimension has increased by 1 and C(ST(iL^) is 
of height 3 (see Figure 14 for a demonstration). The rest of the proof will focus on proving the 
following claim: G has a vertex cover of size at most a if and only if there exists a permutation 7r~^ 
of the vertices of K^ such that |C(ST(Kr)j(^_^g)| is at most 2|K| -|- \E\ + a. 


Vertex Cover 
of G 


Figure 14: Demonstration of the construction of K' through an example. 

Consider a vertex cover V' of G of size at most a. We construct a permutation vr of the labels of 
the vertices of Kg so that all vertices in V appear before any other vertex in V. Further, we assume 
that the king is in V'. This is no loss of generality since, if a vertex cover excludes the king, we can 
always hnd another vertex cover of the same size which includes the king. Further, we set the label 
of the king under vr to 1. In C(ST(iF 7 ro 0 )), we distinguish three layers. Layer 1 consists of the root, 
the vertices of V appear in layer 2 , and all vertices but the king appear in layer 3. There are \V'\ 
edges from layer 1 to layer 2, |K| edges from layer 2 to layer 3 and |K| — \ V'\ edges from layer 1 to 
layer 3. Now, we extend vr to construct a permutation of labels of the vertices of K'^ by setting 
7 r(|l/| -|- 1) = \V\ + 1 and keeping the same labels as in tt for the remaining vertices. We can see 
C(ST(iL(j^+^g)) as adding one new node and some edges to C(ST(iL 7 ro 0 )). Node with label |K| -|- 1 is 
put on layer 4 and we introduce an edge from all vertices in layers 1, 2 and 3 to this node in layer 4. 
Thus we have |C(ST(iL++^,))| = |C(ST(iL^oe))|+l+|K'| + (|K|-l) = 2 |K| + |K| + |W| < 2\V\ + \E\+a. 
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To prove the other direction of the equivalence, we claim that suppose the vertices of can 
be labeled under a permutation 7r~^ such that the size of C(ST) is at most 2\V\ + \E\+ a then, we 
can find a vertex cover of G of size at most a. In order to find this vertex cover, we observe that 
the structure of an optimal C(ST) under Tr"*" will have only one node in layer 4 and this will contain 
the vertex v, i.e., the new vertex introduced in Kq with label |I4| + 1. This is formally stated below 
as Claim 1. Thus, the structure of C(ST) is the one we had previously encountered and can extract 
the vertex cover of G by looking at the nodes on layer 2 (layer 2 can have at most a nodes since we 
will observe later that in an optimal C(ST), the king will appear with label 1 under vr^). We will 
now see that under any permutation if the label of v is not in the last position then, we can find a 
new permutation which leads to C(ST) of smaller size with the label of v being at the last position. 

Claim 1. Given a permutation 'k~^ sueh that Tr+dCl + 1) / |I4| + 1, we can construct a new 
permutation p~^ which is obtained by moving |I/| + 1 to the last position sueh that^: 

\C{ST{KX^))\ > |C(ST(i^+„,))|. 

Thus, if we have a permutation -k^ such that |C(ST(ii'^+^g))| < 2|y| + \E\ + a, we may assume 
that TT+dCl + 1) = |I/| + 1 and we can select the vertex cover of G by considering the vertices on 
layer 2 of C(ST(iir^+^g)). Next, we argue that in vr"’', we may assume 7r'''(l) = 1 (i.e., the king gets 
label 1 under vr'*'), since otherwise, we can always move it to the first position and it does not affect 
the size of C(ST). It is now easy to see that the size of layer 2 is less than a as the king is on layer 
2 . 


We know from Lemma 9 that the vertex cover problem for King-Maker graphs is NP-Hard, 
and this implies that CStMINIMIZATION is also NP-Hard. Thus, we have that CStMINIMIZATION 
is NP-Complete. □ 

Proof of Claim 1. Let C(ST(iL(^+^g)) \ v denote the graph obtained if we delete all the nodes (and 
all entering or exiting edges incident on these nodes) containing the label of v in C(ST(Kr(^+^g)). 
Note that C(ST(iL^+^g)) \ V is exactly the same as C(ST(iL 7 roe)) whose size was invariant over vr 
(because K is a 1-dimensional simplicial complex). Therefore, it suffices to show that the sum 
of the number of edges entering or leaving all nodes containing label of v in C(ST(iL(^_^^g)) is at 
least the sum of the number of edges entering or leaving all nodes containing the label of v in 
C(ST(iL^^g)). Let L (resp., R) denote the set of all vertices in G whose label under tt o 6 appear 
before (resp., after) 7r'*'(0(v)). Let A be the number of edges between L and R in G. Then, we 
claim that |C(ST(K^+^g))| — |C(ST(iL^^g))| = A > 0. To see why the claim is true, let us try to 
analyze the position of the nodes containing label of v in C(ST(iL(^+^g)). Write Ly for the label of 
vertex v, i.e., = 7r'*'(|I/| -|- 1). If there is an edge between two nodes in R then, we will have a 

node with label Ly in layer 2. If there is an edge between two nodes in L then, we will have a node 
with label Ly in layer 4. Finally, if there is an edge between a node in R and a node in L then, we 
will have a node with label Ly in layer 3. By moving the position of v to the last position, we are 
getting rid of the appearance of nodes with label Ly from layer 2 and 3. Now, the king is either 
in L or in R, and in both cases, we know that there are edges from the king to every other node 

^Here we see tt'*" and p'^ as strings of length |P| -|- 1 obtained by their action on 1, 2,..., |F| -|- 1, and by ’moving’ 
we mean to remove an element from its current position and insert it in the new position. 
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in L U -R. Therefore, the disappearance of nodes in layer 3, reduces by A, the number of edges 
because all edges between nodes in L in layer 2 and node containing label in layer 4, and all 
edges between nodes in R in layer 2 and node containing label in layer 4 were already existing 
inC(ST(K++„,)). □ 


8 Discussion and Conclusion 


In this paper, we introduced a compression technique for the Simplex Tree without compromising 
on functionality. Additionally, we have proposed two new data structures for simplicial complexes 
- the Maximal Simplex Tree and the Simplex Array List. We observed that the Minimal Simplex 
Automaton is generally smaller than the Simplex Automaton. Further, we showed that the Maximal 
Simplex Tree is compact and that the Simplex Array List is efficient (and compact when d is fixed). 
This is summarized in Table 4. 

The transitive closure of MxST may have a node, with as many as kd outgoing edges to 
neighbors containing the same label. SAL reduces the number of outgoing edges to such neigh¬ 
bors with the same label from kd to d, making it much more powerful. In short, it reduces the 
non-determinism of their equivalent automaton representation. Also, most complexes observed in 
practice have A: to be a low degree polynomial in n. Example 4 and Lemma 4 both deal with com¬ 
plexes where k is small. Further, all hardness results in section 7 are for complexes of dimension at 
most 2. Thus, complexes where either /c or d is small are interesting to study and for these cases, 
SAL is very efficient. 

Marc Glisse and Sivaprasad implemented SAL [MS15] for Data Set 1 on some values of r, and 
then performed insertion and removal of random simplices, and contracted randomly chosen edges. 
They made the following observations: 

• In all their experiments (n ~ 10000, r G [0.2, 0.5]), size of 1-SAL was significantly smaller 
than size of ST, and in most cases ST would run out of memory. 

• They found that 1-SAL outperformed 0-SAL in low dimensions. However, 0-SAL performed 
better than 1-SAL in higher dimensions. 

• They modified 1-SAL by storing fewer edges and this was noted to save a factor of two in size 
(in dimension 30), over the 1-SAL proposed in this paper. 

Therefore, it would be worth exploring for which class of simplicial complexes, i—SAL is the best 
data structure in the SAL family (for every i € N). 

Trie Compression, like that of A4(SA), are efficient techniques when the trie is assumed to be 
static. However, over the last decade, this has been extended using Dynamic Minimization - the 
process of maintaining an automaton minimal when insertions or deletions are performed. This has 
been well studied in [CF02] and [SFK03], and extended to acyclic automata in [DMWWOO] which 
would be of particular interest to us. Interestingly, it appears that in all of the works above, the 
finiteness of the language plays no special role and, for the specific case of SA, results may be made 
sharper. 
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Another direction, is to look at approximate data structures for simplicial complexes, i.e., we 
store almost all the simplices (introducing an error) and gain efficiency in compression (i.e., little 
storage). This is a well explored topic in automata theory called hyperminimization [Mil] and since 
our language is finite, fc—minimization [BGS09] and cover automata [CSYOl] might give efficient 
approximate data structures by hyperminimizing SA. We motivate this with the help of building 
complexes from a random sample of a large data set. By sampling we are bound to lose information 
and, instead of taking a random sampling we can look at constructing hyperminimized SA over the 
entire data set dynamically. It will be interesting to know if the power of randomness can overcome 
‘smart’ approximations. 

Theorem 3 can be generalized to give the following hardness result; Given a word w on alphabet 
set S and lexicographic ordering 0 on S, let Cg be an operation which removes all duplicate letters in 
the word and rearranges the letters of the word in lexicographic order given by 6. Given a language 
L on alphabet set S, we define Cg{L) = \ w & L}. Therefore with these definitions, we give 

the following general result: 

Theorem 5. Given a finite language L (explicitly through the words in L) on alphabet set S and 
an integer x, it is NP-Hard to decide if there exists an ordering 6 on H such that the size of the 
smallest DFA recognizing Cq{L) is less than x. 

Theorem 4 and 5 provide a new dimension to the hardness results obtained by Gomer and 
Sethi in [CS76]. It would be worth exploring this direction further. Also, it would be interesting 
to find approximation algorithms for MSAMINIMIZATION. 

Performance of z-SAL depends on the value of Tj - are there any interesting bounds on Tj 
for some subset of nice simplicial complexes of IC{n,k,d,m)? Finally, proving better bounds on 
extent of compression remains an open problem and may be geometric constraints will eliminate 
pathological examples which hinder in proving good bounds on compression. 
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A Adapted Hopcroft’s Algorithm 


We precisely describe here Hopcroft’s algorithm adapted to the compression of ST to provide as 
output C(ST). The idea is to write the Simplex Tree as the Simplex Automaton: each vertex of the 
tree becomes a state in the automaton. Then, one just needs to reduce the Simplex Automaton 
using Hopcroft’s algorithm and finally obtain the compressed Simplex Tree by partitioning the 
states of the Minimal Simplex Automaton. We see in Figure 4, the compressed Simplex Tree of 
the simplicial complex described in Figure 1. 

Algorithm 1 Hopcroft’s Algorithm for compression of Simplex Tree 

Input: A Simplex Tree T. Let L be the set of leaves, N be the set of internal nodes and V be the 
set of the vertices of the simplicial complex. 

Output: A compressed Simplex Tree 7~'. 

1: S^0 
2: V ^ {L,N} 

3: for u G H do 
4: Add (L, v) in S. 

5: end for 

6: while <S 7 ^ 0 do 

7: Pop one element (C, v) in S. 

8: for each B such that there exists ni and n 2 in B such that there is an edge in T from 

ni to a node of C labelled by v, and it is not the case for n 2 do 
9: •(— {n € H|there is an edge from n to a node in C labelled by u} 

10: B" •(— {n G Hjthere is no edge from n to a node in C labelled by v} 

11: Replace B by B' and B" in V. 

12: for rc G R do 

13: if {B,w) G S then 

14: Replace {B,w) by {B',w) and {B",w) in S. 

15: else 

\b' if \B’\ < \B'’\ 

16: ^ ^ \ I I - I I 

1 B" otherwise. 

17: Add {D, w) in S. 

18: end if 

19: end for 

20: end for 

21: end while 

22: Now we just have to build the compressed Simplex Tree T'■ 

23: Let S be the part in V which contains the root of T- 
24: T' ^ initial state: (S',*) labeled *. 

25: V ^ (S,*) 

26: while V 7 ^ 0 do 

27: Pop an element {B,w) in V. 

28: for u G R do 

29: Let b be an element in R. \* It will be a representative of B. 
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30: Let a be (if it exists) the node in T reached from b by reading v. 

31: Let A be the part in V which contains a. 

32: if [A, v) is not already in T' then 

33: Add a vertex (A, v) in T'. 

34: Push (A, v) in V. 

35: end if 

36: Add an edge from {B, w) to (A, v) in T'■ 

37: end for 

38: end while 


B Operations on the Maximal Simplex Tree 


We provide below a list of basic operations on the MxST. In the following subsections we denote by 
TmxST the maximum time taken to search for particular children of any node in MxST (TmxST = 
O(logn) for red-black trees). 

B.l Identifying the Maximal Cofaces of a Simplex 

As seen in section 4, a simplex cr G AT is implicitly stored in MxST(Ar) as a face of its (at most k) 
maximal cofaces. Identifying the maximal cofaces of a simplex means to find the leaves of MxST (AT) 
that contain the maximal cofaces of a and also to find, on each path from the root to such a leaf, 
the location of the vertices of a. 

We formalize this by first defining a bijection / from the set of all maximal simplices of K to 
the set of all leaves in MxST(A') and also define a function g that maps every simplex cr G AT to the 
set of all maximal simplices in K which contain a (thus if f{cr) = 0 then a ^ AT). For simplicity, we 
will denote f{g{a)) by L{a) and \L{a)\ by k^ from now on. Since identifying a simplex a reduces 
to traversing MxST(Ar), this operation costs ©(/cdTMxSx) = 0{kdlogn). 

B.2 Insertion 

Let CT be a simplex not in AT and let K' be the complex obtained by adding a and its subsimplices to 
K. Observe that a is necessarily maximal in K' and write d„ for the dimension of a. We describe 
how to insert a in MxST. We first check if there exists maximal simplices in MxST(Ar) that are 
contained in a. This can be done in ©(/cdo-TMxST) time, by looking at the MxST truncated to depth 
d(j. If such simplices exist, we will need to delete them before inserting a, which takes time at most 
©(fcdo-TMxSx) (see analysis of step 3 of Algorithm 2). Then, we insert a in MxST. This takes at most 
0{daTy[xST) time, which is significantly better than the time taken for the Simplex Tree, which 
needs 0(2 '^'^TmxSt) time. We conclude that the time for inserting a is 0{kdaTMxST) = 0{kda log n). 
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B.3 Removing a Face 


Given a simplex a, we have to remove it (and its cofaces) from the MxST. We can perform the 
operation of removing simplices (the simplex and its cofaces) as described in Algorithm 2. 


Algorithm 2 Removing simplices in Maximal Simplex Tree 

Input: A Maximal Simplex Tree and a simplex a. 

Output: A Maximal Simplex Tree. 

1: Compute L{a). 

2: for each T G L{a) do 
3: Remove the branch ending at T. 

4: for every vertex v m. a insert T \ {u}. 

5: end for 


Step 1 was shown earlier to take OikdTyixST) time. Since T is maximal, it is associated to 
a leaf in the tree. Let P{T) be the path from the root to the leaf associated to a. For removing 
a maximal simplex T, one needs to locate on P{T), the last node w such that the out-degree for 
the edges is strictly more than one (it corresponds to the last node which is shared with another 
maximal simplex). Then, one just has to delete the edge on the corresponding path going from 
w. So removing one maximal simplex (at step 3) takes time O^dTuxST) (since we might have to 
potentially delete up to d nodes). At step 4, d^ facets of T need to be inserted and each insertion 
was shown earlier to be doable in time ©(TMxST^r)) (since we do not have to check if there exists 
maximal simplices in MxST(A') that are contained in facets T). The total algorithm costs time 
©(ko-dRMxST+ko-TMxSTdr) = O (fed log (n)). 

B.4 Elementary Collapse 

An elementary collapse consists of removing both simplices of a free pair. It preserves the homotopy 
type of the complex. We break down the computation of an elementary collapse into 3 steps which 
is described in Algorithm 3. 


Algorithm 3 Elementary collapse in Maximal Simplex Tree 

luput: A Maximal Simplex Tree and a pair of simplices (t, cr). 

Output: A Maximal Simplex Tree. 

1: Check if (r, a) is a free pair. 

2: Delete r. 

3: Insert the facets of r which are different from a, and are now maximal. 


It easily follows from the above results that Step 1 takes O(fed log(n)) time. As for the removing 
operation, step 2 is feasible in time C(dTMxST) (see analysis of step 3 of Algorithm 2). For step 
3, we have dr facets to check if they are now maximal and then insert the ones that are indeed 
maximal. This can be done in time 0{kdda log(n)). 
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B.5 Edge Contraction 


Edge contraction is another operation on simplicial complexes that preserves the homotopy type un¬ 
der certain conditions and which can be implemented on the MxST using the above operations. We 
break down the computation of an edge contraction into 3 steps which is described in Algorithm 4. 


Algorithm 4 Edge contraction in Maximal Simplex Tree 

Input: A Maximal Simplex Tree and a pair of vertices {u,v). 

Output: A Maximal Simplex Tree. 

1: Replace the label u by the label v each time u appears in the MxST. 
2: Store the list L of all the maximal simplices (lexicographically). 

3: Build the MxST from L. 


Steps 1 and 3 can be done in time 0{kdlog n) (follows from our previous results). However, to 
perform step 2, we need to remove the simplices which were previously maximal, but are no longer 
maximal after the edge contraction. This can be done in time 0{kd ■ k') [Y92], where k' is the 
number of maximal simplices after the edge contraction. So an edge contraction can be performed 
in time 0{kd\ogn + kdk') = 0{kd{k + logn)). Finally, we note that the above algorithm is a 
multiplicative factor of k/logn in the worst case away from the upperbound for MxST. For any 
k, d, let us consider the simplicial complex in Example 6. 

Example 6. Let K G K.{kd/A + k/2 + d+l,k,d,m) be defined by the union of the sets of maximal 
simplices given below: 

1. {1, 2,... , d, d + ii|l < i < A:/2}. 

2. {z-d+l-|-/c/2,i-d + 2 + /i;/2, ...,i-d-|-/c/2-|-d| 1 <i <k/A}. 

3. {i ■ d + '\^ + k/2,i ■ d + 2 + k/2^... fi ■ d + k/2 + d — l,kd/A + k/2 + d + 1 \ A <i <k/A}. 

4 . {1,/cd/4-|-/i;/2 + d-|-1}. 

If the vertex A:d/4 + A:/2-|-d-|-l is contracted to 1 then, the new simplicial complex is generated 
by the union of the sets of maximal simplices given below: 

1. {1,2,... ,d,d + i|l < i < k/2}. 

2. {i-d+A + k/2,i-d + 2 + k/2,...,i-d + k/2 + d\ I < i < k/A}. 

3. {l,i-d+l-|-A:/2,i-d-|-2-|-A;/2, ...,i-d + A:/2-|-d—1 | 1 <i <k/A}. 

The new MxST has at least k{d— l)/4 new nodes (of size logn) that need to be added. Thus, 
we have a lower bound of Ll(kdlogn) on the operation of edge contraction for MxST. 
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